How to come public, with private data
In a current work with Stephen Jenkins, one problem we face now that we are starting the submission process, we find that many journals require a replication package along with the paper.
Providing the code for the model estimations that we created, and even including the Code developed by others authors, is relatively simple. The problem, which many people may face, is how to distribute data that, due to privacy or proprietary reasons, we are not allowed to share.
As a matter of fact, on this particular project, only Stephen has seen the data, whereas I have worked from the far, primary on the code that estimates the new models (if interest in the research, I'm providing references to previous papers we have worked on below).
So now that is time for the "big" paper to be published, we need some strategy to construct a synthetic dataset that will fulfill all privacy protection constraints, while still transferring the moments' structure that we care about, as well as those we may not be so interested in, but may be of interest for other people.
So, with this idea in mind, I came up with a simple strategy that may help to do just that. An application of Multiple Imputation. This may not be THE best method to do this, so I'll be happy to hear any comments.
To better describe how the method works, I will use one dataset that is readily available online. This data is an excerpt of the Swiss labor market Survey 1998, which is provided as the example dataset in the command -oaxaca- (By Jann 2008).
The problem
Assume that you have signed a confidentiality agreement to work with Swiss Survey data. And are ready to submit your work, but you are asked to provide a replication package, with a code to produce the tables you have, and the dataset itself.
Since you cannot share the data, you suggest instead to provide a 5 synthetic synthetic datasets, so people can apply your code, and get to similar (if not the same) conclusions as in your main paper. Here is a piece of code you could use for that.
. ** Assumption. You have a dataset that you want to use
. clear all
. use http://fmwww.bc.edu/RePEc/bocode/o/oaxaca.dta, clear
(Excerpt from the Swiss Labor Market Survey 1998)
. misstable summarize
Obs<.
+------------------------------
| | Unique
Variable | Obs=. Obs>. Obs<. | values Min Max
-------------+--------------------------------+------------------------------
lnwage | 213 1,434 | >500 .507681 5.259097
exper | 213 1,434 | >500 0 49.16667
tenure | 213 1,434 | 323 0 44.83333
isco | 213 1,434 | 9 1 9
-----------------------------------------------------------------------------
Four variables have missing data: Wages, tenure, experience, and ISCO, And they are missing when LFP=0
Now, I suggest creating 1 variable, that will be a "seed" that will be used to recreate synthetic datasets. It will just be a random uniform variable that will range from 0 to 100. And an ID variable.
. gen id = _n
. set seed 10101
. gen seed = runiform(0,100)
The next step is to decide how large the synthetic dataset will be. The obvious answer is to create a dataset with the same number of observations, but if you want other sample sizes, it could be adjusted. So I'll expand the dataset, duplicating observation 1, 1648 times. I will also tag the original observation:
. expand 1648 in 1, gen(tag)
(1,647 observations created)
You can now set to missing, variables with tag=1
. foreach i of varlist lnwage educ exper tenure isco female lfp age single married divorced kids6 kids714 wt {
2. replace `i'=. if tag==1
3. }
(1,647 real changes made, 1,647 to missing)
(1,647 real changes made, 1,647 to missing)
(1,647 real changes made, 1,647 to missing)
(1,647 real changes made, 1,647 to missing)
(1,647 real changes made, 1,647 to missing)
(1,647 real changes made, 1,647 to missing)
(1,647 real changes made, 1,647 to missing)
(1,647 real changes made, 1,647 to missing)
(1,647 real changes made, 1,647 to missing)
(1,647 real changes made, 1,647 to missing)
(1,647 real changes made, 1,647 to missing)
(1,647 real changes made, 1,647 to missing)
(1,647 real changes made, 1,647 to missing)
(1,647 real changes made, 1,647 to missing)
And you will need to recreate the "seed" variable as well
. replace seed = runiform(0,100) if tag==1
(1,647 real changes made)
We may also need to set LFP since we have missing data depending on LF status
. replace lfp = runiform()<.87 if tag==1
(1,647 real changes made)
The next step is to create Multiple Imputed datasets. I believe the best strategy here is to use "pmm", because that uses the observed distribution and data types. So first mi set the data, and register all variables to be imputed:
. mi set wide
. mi register impute lnwage educ exper tenure isco female age single married kids6 kids714 wt
And simply impute all variables using chain pmm. Just make sure none of the variables are collinear (here colinearity exists between single, married, and divorced), and that variables with structural missing data are specified separately.
Notice as well that the explanatory variables are "seed" (fully random) and LFP (also random).
. mi impute chain (pmm, knn(100)) educ female age single married kids6 kids714 wt (pmm if lfp==1, knn(100) ) lnwage exper tenure isco = seed lfp, add(5)
note: missing-value pattern is monotone; no iteration performed
Conditional models (monotone):
educ: pmm educ seed lfp , knn(100)
female: pmm female educ seed lfp , knn(100)
age: pmm age female educ seed lfp , knn(100)
single: pmm single age female educ seed lfp , knn(100)
married: pmm married single age female educ seed lfp , knn(100)
kids6: pmm kids6 married single age female educ seed lfp , knn(100)
kids714: pmm kids714 kids6 married single age female educ seed lfp , knn(100)
wt: pmm wt kids714 kids6 married single age female educ seed lfp , knn(100)
lnwage: pmm lnwage wt kids714 kids6 married single age female educ seed lfp if lfp==1, knn(100)
exper: pmm exper lnwage wt kids714 kids6 married single age female educ seed lfp if lfp==1, knn(100)
tenure: pmm tenure exper lnwage wt kids714 kids6 married single age female educ seed lfp if lfp==1, knn(100)
isco: pmm isco tenure exper lnwage wt kids714 kids6 married single age female educ seed lfp if lfp==1, knn(100)
Performing chained iterations ...
Multivariate imputation Imputations = 5
Chained equations added = 5
Imputed: m=1 through m=5 updated = 0
Initialization: monotone Iterations = 0
burn-in = 0
educ: predictive mean matching
female: predictive mean matching
age: predictive mean matching
single: predictive mean matching
married: predictive mean matching
kids6: predictive mean matching
kids714: predictive mean matching
wt: predictive mean matching
lnwage: predictive mean matching
exper: predictive mean matching
tenure: predictive mean matching
isco: predictive mean matching
------------------------------------------------------------------
| Observations per m
|----------------------------------------------
Variable | Complete Incomplete Imputed | Total
-------------------+-----------------------------------+----------
educ | 1647 1647 1647 | 3294
female | 1647 1647 1647 | 3294
age | 1647 1647 1647 | 3294
single | 1647 1647 1647 | 3294
married | 1647 1647 1647 | 3294
kids6 | 1647 1647 1647 | 3294
kids714 | 1647 1647 1647 | 3294
wt | 1647 1647 1647 | 3294
lnwage | 1434 1458 1458 | 2892
exper | 1434 1458 1458 | 2892
tenure | 1434 1458 1458 | 2892
isco | 1434 1458 1458 | 2892
------------------------------------------------------------------
(complete + incomplete = total; imputed is the minimum across m
of the number of filled-in observations.)
That's it. You have now 5 sets of variables that can be used to create unique synthetic datasets, with a similar structure as the original confidential dataset, and that could be used for replication and public use.
. forvalues i = 1/5 {
2. preserve
3. keep if tag==1
4. keep _`i'_* lfp
5. ren _`i'_* *
6. save fake_oaxaca_`i', replace
7. restore
8. }
(1,647 observations deleted)
(note: file fake_oaxaca_1.dta not found)
file fake_oaxaca_1.dta saved
(1,647 observations deleted)
(note: file fake_oaxaca_2.dta not found)
file fake_oaxaca_2.dta saved
(1,647 observations deleted)
(note: file fake_oaxaca_3.dta not found)
file fake_oaxaca_3.dta saved
(1,647 observations deleted)
(note: file fake_oaxaca_4.dta not found)
file fake_oaxaca_4.dta saved
(1,647 observations deleted)
(note: file fake_oaxaca_5.dta not found)
file fake_oaxaca_5.dta saved
Now let's see if this works, estimating simple an LR, CQR model, and a Heckman model.
. frame create test
.
. frame test: {
. use http://fmwww.bc.edu/RePEc/bocode/o/oaxaca.dta, clear
(Excerpt from the Swiss Labor Market Survey 1998)
. qui:reg lnwage educ exper tenure female
. est sto m1
. qui:qreg lnwage educ exper tenure female, q(10)
. est sto m2
. qui:heckman lnwage educ exper tenure female age, selec(lfp =educ female age single married kids6 kids714) two
. est sto m3
. }
.
. forvalues i = 1/5 {
2. frame test: {
3. use fake_oaxaca_`i', clear
4.
. qui:reg lnwage educ exper tenure female
5. est sto m1`i'
6. qui:qreg lnwage educ exper tenure female, q(10)
7. est sto m2`i'
8.
. qui:heckman lnwage educ exper tenure female age, selec(lfp =educ female age single married kids6 kids714) two
9. est sto m3`i'
10. }
11. }
(Excerpt from the Swiss Labor Market Survey 1998)
(Excerpt from the Swiss Labor Market Survey 1998)
(Excerpt from the Swiss Labor Market Survey 1998)
(Excerpt from the Swiss Labor Market Survey 1998)
(Excerpt from the Swiss Labor Market Survey 1998)
. ** OLS
. esttab m1 m11 m12 m13 m14 m15, mtitle(Original Fake1 Fake2 Fake3 Fake4 Fake5)
------------------------------------------------------------------------------------------------------------
(1) (2) (3) (4) (5) (6)
Original Fake1 Fake2 Fake3 Fake4 Fake5
------------------------------------------------------------------------------------------------------------
educ 0.0848*** 0.0709*** 0.0643*** 0.0727*** 0.0640*** 0.0588***
(16.34) (14.21) (12.36) (14.67) (13.37) (12.32)
exper 0.0111*** 0.00913*** 0.00950*** 0.00995*** 0.00752*** 0.0110***
(7.22) (6.88) (6.29) (6.76) (5.16) (7.22)
tenure 0.00771*** 0.00570*** 0.00718*** 0.00644*** 0.00670*** 0.00540**
(4.10) (3.41) (3.70) (3.57) (3.75) (2.79)
female -0.0841*** -0.0398 -0.1000*** -0.0711** -0.111*** -0.0767**
(-3.35) (-1.75) (-3.97) (-2.85) (-4.60) (-3.07)
_cons 2.213*** 2.428*** 2.490*** 2.373*** 2.542*** 2.557***
(32.38) (37.91) (36.03) (35.83) (39.59) (39.51)
------------------------------------------------------------------------------------------------------------
N 1434 1458 1458 1458 1458 1458
------------------------------------------------------------------------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001
. ** qreg 10
. esttab m2 m21 m22 m23 m24 m25, mtitle(Original Fake1 Fake2 Fake3 Fake4 Fake5)
------------------------------------------------------------------------------------------------------------
(1) (2) (3) (4) (5) (6)
Original Fake1 Fake2 Fake3 Fake4 Fake5
------------------------------------------------------------------------------------------------------------
educ 0.103*** 0.0679*** 0.0738*** 0.0881*** 0.0703*** 0.0735***
(6.21) (5.48) (5.76) (6.61) (7.69) (5.52)
exper 0.0200*** 0.00867** 0.00944* 0.0147*** 0.00769** 0.0196***
(4.06) (2.63) (2.54) (3.72) (2.76) (4.62)
tenure 0.000669 0.00326 0.0108* 0.00419 0.00531 -0.000442
(0.11) (0.79) (2.26) (0.86) (1.56) (-0.08)
female -0.151 -0.000802 -0.0706 -0.0950 -0.0970* -0.0831
(-1.87) (-0.01) (-1.14) (-1.41) (-2.10) (-1.19)
_cons 1.462*** 2.035*** 1.869*** 1.702*** 2.028*** 1.866***
(6.67) (12.81) (10.98) (9.55) (16.55) (10.32)
------------------------------------------------------------------------------------------------------------
N 1434 1458 1458 1458 1458 1458
------------------------------------------------------------------------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001
. ** heckman
. esttab m3 m31 m32 m33 m34 m35, mtitle(Original Fake1 Fake2 Fake3 Fake4 Fake5)
------------------------------------------------------------------------------------------------------------
(1) (2) (3) (4) (5) (6)
Original Fake1 Fake2 Fake3 Fake4 Fake5
------------------------------------------------------------------------------------------------------------
lnwage
educ 0.0717*** 0.0677*** 0.0575*** 0.0701*** 0.0594*** 0.0534***
(13.13) (13.02) (10.45) (13.28) (11.91) (10.60)
exper 0.00179 0.00264 0.00108 0.00271 -0.00226 0.00180
(0.94) (1.56) (0.55) (1.51) (-1.28) (0.94)
tenure 0.00200 0.00199 0.00356 0.00184 0.00144 0.00130
(1.01) (1.14) (1.79) (0.98) (0.79) (0.66)
female -0.105*** -0.0979** -0.154*** -0.170*** -0.204*** -0.143***
(-3.59) (-3.01) (-4.42) (-5.30) (-6.75) (-4.67)
age 0.0146*** 0.00946*** 0.0113*** 0.0104*** 0.0140*** 0.0122***
(7.92) (5.90) (6.56) (6.28) (8.58) (7.16)
_cons 1.991*** 2.220*** 2.281*** 2.134*** 2.230*** 2.309***
(27.12) (30.89) (30.16) (28.82) (31.24) (32.32)
------------------------------------------------------------------------------------------------------------
lfp
educ 0.149*** 0.133*** 0.156*** 0.170*** 0.141*** 0.160***
(5.37) (4.93) (6.11) (6.70) (5.59) (6.19)
female -1.785*** -1.592*** -1.745*** -1.480*** -1.607*** -1.502***
(-11.09) (-10.11) (-10.37) (-9.87) (-10.26) (-9.71)
age -0.0388*** -0.0214*** -0.00603 -0.0270*** -0.0268*** -0.0252***
(-5.77) (-4.02) (-1.07) (-4.76) (-4.64) (-4.34)
single -0.0998 -0.583** 0.0134 -0.379 -0.483* -0.159
(-0.43) (-3.00) (0.07) (-1.91) (-2.37) (-0.74)
married -0.867*** -0.698*** -0.667*** -0.605*** -0.873*** -0.705***
(-5.48) (-4.14) (-3.97) (-3.89) (-5.13) (-4.48)
kids6 -0.716*** -0.578*** -0.399*** -0.689*** -0.584*** -0.588***
(-8.71) (-7.56) (-4.98) (-9.20) (-7.70) (-7.79)
kids714 -0.343*** -0.131* -0.177** -0.282*** -0.206** -0.207**
(-5.26) (-2.03) (-2.83) (-4.24) (-3.19) (-3.19)
_cons 3.543*** 2.645*** 1.701*** 2.438*** 2.941*** 2.450***
(7.29) (6.14) (4.11) (5.94) (6.73) (5.66)
------------------------------------------------------------------------------------------------------------
/mills
lambda -0.123 0.118 0.0737 0.258** 0.218** 0.142
(-1.88) (1.43) (0.83) (3.24) (2.94) (1.86)
------------------------------------------------------------------------------------------------------------
N 1647 1647 1647 1647 1647 1647
------------------------------------------------------------------------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001
What about covariances:
. frame test: {
. use http://fmwww.bc.edu/RePEc/bocode/o/oaxaca.dta, clear
(Excerpt from the Swiss Labor Market Survey 1998)
. mean lnwage exper tenure educ female age single married kids6 kids714
Mean estimation Number of obs = 1,434
--------------------------------------------------------------
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
lnwage | 3.357604 .0140235 3.330096 3.385113
exper | 13.15324 .2632213 12.6369 13.66958
tenure | 7.860937 .2144401 7.440287 8.281587
educ | 11.53696 .0639585 11.4115 11.66242
female | .4762901 .0131934 .4504096 .5021706
age | 38.83891 .2915321 38.26704 39.41079
single | .3891213 .0128794 .3638568 .4143859
married | .4700139 .0131845 .4441509 .495877
kids6 | .2182706 .0151344 .1885826 .2479586
kids714 | .2782427 .0172008 .2445013 .311984
--------------------------------------------------------------
.
. corr lnwage exper tenure educ female age single married kids6 kids714 , cov
(obs=1,434)
| lnwage exper tenure educ female age single married kids6 kids714
-------------+------------------------------------------------------------------------------------------
lnwage | .28201
exper | 1.23107 99.3553
tenure | 1.03799 47.0903 65.9418
educ | .469384 -3.24851 -.510834 5.86604
female | -.043298 -.484036 -.598583 -.14532 .249612
age | 2.05353 79.3047 54.7529 1.62913 .213554 121.877
single | -.061535 -1.71735 -1.22853 -.005669 -.001235 -2.87447 .237872
married | .044889 1.05484 .938517 .089909 -.027229 1.75406 -.18302 .249275
kids6 | .030479 -.557053 -.447353 .118061 -.034249 -.953649 -.077317 .096222 .328459
kids714 | .036036 .088118 .006038 .020763 .001368 .469835 -.099274 .100813 .018081 .424272
.
. }
. forvalues i = 1/2 {
2. frame test: {
3. use fake_oaxaca_`i', clear
4. mean lnwage exper tenure educ female age single married kids6 kids714
5.
. corr lnwage exper tenure educ female age single married kids6 kids714 , cov
6. }
7. }
(Excerpt from the Swiss Labor Market Survey 1998)
Mean estimation Number of obs = 1,458
--------------------------------------------------------------
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
lnwage | 3.388626 .0123486 3.364403 3.412848
exper | 13.41598 .2670916 12.89206 13.93991
tenure | 7.820988 .2108657 7.407355 8.23462
educ | 11.45045 .0597336 11.33327 11.56762
female | .4677641 .0130718 .4421225 .4934056
age | 39.12826 .2939283 38.55169 39.70483
single | .3545953 .0125329 .3300108 .3791799
married | .5034294 .0130988 .4777349 .5291238
kids6 | .1920439 .0139977 .1645861 .2195017
kids714 | .3155007 .0185486 .2791158 .3518855
--------------------------------------------------------------
(obs=1,458)
| lnwage exper tenure educ female age single married kids6 kids714
-------------+------------------------------------------------------------------------------------------
lnwage | .222327
exper | 1.01525 104.011
tenure | .693356 44.8687 64.829
educ | .338334 -2.88977 -1.34832 5.20229
female | -.020663 -.375623 -.251828 -.083016 .249132
age | 1.54067 84.3489 53.0935 .107939 .253623 125.962
single | -.045191 -1.51045 -1.00048 .021702 -.008122 -2.77303 .229015
married | .033517 .992395 .849162 .073523 -.01945 1.8242 -.178636 .25016
kids6 | .015775 -.812525 -.438087 .15314 -.004787 -.966309 -.057163 .076211 .285674
kids714 | .020063 -.213007 -.197544 .013243 -.000118 .267674 -.08793 .1053 -.006411 .501626
(Excerpt from the Swiss Labor Market Survey 1998)
Mean estimation Number of obs = 1,458
--------------------------------------------------------------
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
lnwage | 3.362503 .0134839 3.336053 3.388953
exper | 13.19293 .2599656 12.68298 13.70288
tenure | 7.572988 .1992737 7.182094 7.963882
educ | 11.51749 .0640525 11.39184 11.64313
female | .4718793 .0130783 .4462249 .4975337
age | 38.69342 .2925405 38.11957 39.26726
single | .3909465 .0127837 .3658701 .4160229
married | .4814815 .0130901 .4558041 .5071589
kids6 | .2112483 .0149838 .1818562 .2406404
kids714 | .297668 .0177171 .2629142 .3324218
--------------------------------------------------------------
(obs=1,458)
| lnwage exper tenure educ female age single married kids6 kids714
-------------+------------------------------------------------------------------------------------------
lnwage | .265087
exper | .986651 98.5347
tenure | .758004 40.7152 57.8972
educ | .346533 -4.37921 -1.28108 5.98176
female | -.039653 -.398898 -.37838 -.127854 .24938
age | 1.66528 80.5948 46.2706 .23529 .24429 124.776
single | -.05073 -1.86626 -.965353 -.035669 .010315 -2.92399 .238271
married | .035125 1.2831 .562809 .104476 -.024886 1.86151 -.188363 .249828
kids6 | .012868 -.594919 -.426548 .155534 -.031804 -1.0601 -.06823 .085589 .327341
kids714 | .013946 .323909 -.05909 .070459 -.002605 .382332 -.08488 .109154 -.006645 .457661
Conclusions
As you can see, the results are going to be far from perfect replication of the original dataset. After all, we are introducing random errors to create a synthetic dataset, so other people can try to replicate our work.
With those caveats in mind, what we may end up doing is to create synthetic "fake" data like this one, along with two versions of the results. One based on the actual data, and another based on the synthetic dataset(s).
If you are interested in the code I used, you can get it (with Rep Files here)
References
Jann, Ben (2008). The Blinder-Oaxaca decomposition for linear regression models. The Stata Journal 8(4): 453-479.
Jenkins, SP, RiosāAvila, F, 2021. "Measurement error in earnings data: replication of Meijer, Rohwedder, and Wansbeek's mixture model approach to combining survey and register data." J Appl Econ. 2021. Accepted Author Manuscript. https://doi.org/10.1002/jae.2811 (with Rep Files here)
Jenkins, Stephen P. & Rios-Avila, F, 2020. "Modelling errors in survey and administrative data on employment earnings: Sensitivity to the fraction assumed to have error-free earnings," Economics Letters, Elsevier, vol. 192(C).